Important

Due Date: Noon on Thursday 12th of December

Submission: Compile your answers in this RMarkdown file. Upload it, and supporting files, and a pdf rendering into the appropriate spot in Canvas.

Rules: As this is a take home test, there can’t be any restrictions I can enfornce. I do ask that you do your own work. DO NOT give your R to another by email, usb drive, or the like. Do not let people take pictures of your R code.

Time of Last Revision: 8:00 am 4 December, 2019

Example - Bar Chart

I want to begin the final with an example for you. This example will demonstrate several nice features available to us in the {ggpubr} package.

The format of these visualization exercises may include the following:

  1. A data file or (R dataset)
  2. A professional quality visualization addressing the task including a text caption offering some explanation or observation
  3. An explanation of the idiom in the form of “What? Why? How?” table. (see box to follow)
I made this table in PowerPoint with 12pt font. I copied it and pasted as a picture then did a Save As Picture to create an image I could import into this document.

I made this table in PowerPoint with 12pt font. I copied it and pasted as a picture then did a Save As Picture to create an image I could import into this document.

We will call these “What-Why-Hows”

Note: I believe that a good test should teach you new things and not just stress your internal organs into rupture. Hence, I will often tell you to look up and use particular commands from various packages.

> theme_set(theme_pubr() + theme(legend.position = "right"))
> 
> diam_df<- diamonds %>% group_by(cut) %>% dplyr::summarise(counts=n())
> 
> diam_df
## # A tibble: 5 x 2
##   cut       counts
##   <ord>      <int>
## 1 Fair        1610
## 2 Good        4906
## 3 Very Good  12082
## 4 Premium    13791
## 5 Ideal      21551
> diam.bar.p<-ggplot(diam_df, aes(x=cut,y=counts)) + 
+   geom_bar(fill= "blue",stat="identity") +
+   geom_text(aes(label=counts),vjust=-0.3)
> 
> diam.bar.p <- diam.bar.p + labs(title="Distribution of Diamond Inventory",
+                                 subtitle="Plot of the number of diaonds in each cut group",
+                                 caption="Data Source: diamonds dataset",
+                                 x= "Classifcation by Type of Cut",
+                                 y= "Total Amount in Inventory")+ 
+                             scale_fill_brewer(palette="Dark2") 
> 
> 
>   
> text.bar<-paste("This chart shows stock increases dramatically for the higher quality cuts.  This is fortunate as we can expect a higher profit margin for the Premium and Ideal cuts.")
> 
> text.bar.p<-ggparagraph(text=text.bar,face="italic",size=11,color="black")
> 
> ggarrange(diam.bar.p,text.bar.p,ncol=1,nrow=2,heights=c(3.5,0.5))+
+   theme_pubclean()

Texas Housing

Explore the dataset “txhousing” in {ggplot2}.

“Information about the housing market in Texas provided by the TAMU real estate center, http://recenter.tamu.edu/.”

What-Why-How 1: Histogram

Include a What-Why-How for the histogram idiom.

Plot 1: Histogram

Create a quality histogram of the median price of the houses being sold. Make the x-axis show currency and not in scientific notation. Give it a reasonable title.

Plot 2: Density

Building upon the previous chart, overlay an opaque density on the histogram.

Graduate Students Only Plot 1: Visual investigation of inventory problem

Create to subgroups. The A subgroup will be the subset of housing residing in cities beginning with the letter “A”. Create an “S” subgroup as well. On the same plot, over the the densities. In a text paragraph beneath the histograms, discuss if there seems to be any real difference in the probabilities of selling homes in those cities.

Answer:

What-Why-How 1: Histogram

Texashousing

Texashousing

> #### Plot 1: Histogram ####
> txhousing_print<- print(txhousing)
## # A tibble: 8,602 x 9
##    city     year month sales   volume median listings inventory  date
##    <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
## # ... with 8,592 more rows
> txhousing_summary<- summary(txhousing)
> txhousing_df<-txhousing %>% 
+   group_by(median) %>%
+   group_by(month) %>%
+   dplyr::summarise(counts=n())
> txhousing_df
## # A tibble: 12 x 2
##    month counts
##    <int>  <int>
##  1     1    736
##  2     2    736
##  3     3    736
##  4     4    736
##  5     5    736
##  6     6    736
##  7     7    736
##  8     8    690
##  9     9    690
## 10    10    690
## 11    11    690
## 12    12    690
> txhousing.hist<-ggplot(txhousing, aes(x = median)) +
+   geom_histogram(fill= "blue")
> 
> txhousing.hist <- txhousing.hist + labs(title="Median price of the houses being sold",
+                                 subtitle="Plot of the number of houses sold and price",
+                                 caption="Data Source: Texas Housing",
+                                 x= "Median",
+                                 y= "Count")+ 
+                             scale_fill_brewer(palette="Dark2")
> 
> txhousing.hist

> #### Plot 2: Density ####
> txhousing.density<-ggplot(data = txhousing, aes(x = median)) +
+   geom_density()
> txhousing.density

> ####Graduate Students Only Plot 1: Visual investigation of inventory problem ####
> p <- ggplot(txhousing, aes(date, median)) +
+   geom_line(aes(group = city), alpha = 0.2)
> p

> models <- txhousing %>% 
+   group_by(city) %>%
+   do(mod = lm(
+     log2(sales) ~ factor(month), 
+     data = ., 
+     na.action = na.exclude
+   ))
> models
## Source: local data frame [46 x 2]
## Groups: <by row>
## 
## # A tibble: 46 x 2
##    city                  mod   
##  * <chr>                 <list>
##  1 Abilene               <lm>  
##  2 Amarillo              <lm>  
##  3 Arlington             <lm>  
##  4 Austin                <lm>  
##  5 Bay Area              <lm>  
##  6 Beaumont              <lm>  
##  7 Brazoria County       <lm>  
##  8 Brownsville           <lm>  
##  9 Bryan-College Station <lm>  
## 10 Collin County         <lm>  
## # ... with 36 more rows
> model_sum <- models %>% glance(mod)
> model_sum
## # A tibble: 46 x 12
## # Groups:   city [46]
##    city                  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC deviance df.residual
##    <chr>                     <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
##  1 Abilene                   0.530         0.500 0.282     17.9  1.50e-23    12  -22.2  70.5 112.      13.9         175
##  2 Amarillo                  0.449         0.415 0.302     13.0  7.41e-18    12  -35.0  95.9 138.      15.9         175
##  3 Arlington                 0.513         0.483 0.267     16.8  2.75e-22    12  -12.5  50.9  92.9     12.5         175
##  4 Austin                    0.487         0.455 0.310     15.1  2.04e-20    12  -40.3 107.  149.      16.8         175
##  5 Bay Area                  0.555         0.527 0.265     19.9  1.45e-25    12  -10.5  47.0  89.0     12.3         175
##  6 Beaumont                  0.430         0.395 0.275     12.0  1.18e-16    12  -18.0  62.1 104.      13.3         175
##  7 Brazoria County           0.508         0.475 0.292     15.1  6.48e-20    12  -26.3  78.5 119.      13.7         161
##  8 Brownsville               0.171         0.119 0.420      3.25 4.59e- 4    12  -95.7 217.  259.      30.5         173
##  9 Bryan-College Station     0.651         0.629 0.406     29.7  1.73e-34    12  -90.6 207.  249.      28.9         175
## 10 Collin County             0.601         0.576 0.266     24.0  1.56e-29    12  -11.4  48.8  90.8     12.4         175
## # ... with 36 more rows

Colorblind Flowers?

In this problem set, you will create scatterplots without and with grouping, and finally with regression analysis. We will use the ever popular iris dataset. We place the restriction on the problem that the visualizations should be colorblindness friendly.

For your colors, you are to use the {viridis} R package. It uses beautiful colors that are printer-friendly and uniform, and easy to use.

Use the minimal theme from {ggpubr}

What-Why-How 2: 2-d scatterplot

Insert the visual idiom description for a 2-d scatterplot.

Plot 3: Scatterplot

Using a minimal theme, create a scatterplot of Sepal.Length against Sepal.Width. Create visual grouping of the species suitable for colorblind people. Overlay a regression line with standard error bands. Make sure the axises have physical units listed.

Plot 4: Type of Dot Plot

It would seem to me another approach to aiding those whose are colorblind is to use changes in size of the dots. So for this plot, retain the coloration from the viridis palette, but also change the size of the circle for each species. That is there will be three size circles, each of a constant colorblind-friendly color. Make sure the legend makes clear the two-fold way of visualizing the grouping. A text sentence might be helpful.

Plot 5: Facetting

Yet another way to help is to seperate the scatterplots. For this plot, facet three scatterplots in a row. One scatterplot per species. For this one, include the regression line and error bands around each subplot.

Plot 6: Grouping

We could also highlight the groups for the colorblind (or really any) viewer by putting an enclosing figure about clusters of points, such as an ellipse, around clusters of points. There are many ways and means of doing this, but try the stat_ellipse() function.

Graduate Students Only Plot 2: Edge Densities

Explore the ggscatterhist() plot in {ggpubr}. Create a scatterplot as above without the regression line and error bands. Colorize the groups by species using colorblind safe colors. Finally have the upper edge and right edge present density histrographs to reflect the marginal distributions.

Answer:

What-Why-How 2: 2-d scatterplot

> library(ggplot2)
> #### Plot 3: Scatterplot ####
> head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
> ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, color=Species)) + geom_point()

> pl <- iris$Petal.Length
> sl <- iris$Sepal.Length
> model <- lm(pl ~ sl)
> model
## 
## Call:
## lm(formula = pl ~ sl)
## 
## Coefficients:
## (Intercept)           sl  
##      -7.101        1.858
> pred <- predict(model, data.frame(sl = sort(sl)), level = 0.95, interval = "confidence")
> head(pred)
##         fit       lwr      upr
## 1 0.8898184 0.5928870 1.186750
## 2 1.0756617 0.7935781 1.357745
## 3 1.0756617 0.7935781 1.357745
## 4 1.0756617 0.7935781 1.357745
## 5 1.2615050 0.9940171 1.528993
## 6 1.4473483 1.1941605 1.700536
> lower <- pred[,2]
> upper <- pred[,3]
> 
> # make plot 
> plot(sl, pl, pch=19, type="n")
> grid()
> polygon(c(sort(sl), rev(sort(sl))), c(upper, rev(lower)), col = "gray", border = "gray")
> points(sl, pl, pch=19, cex=0.6)
> abline(model, col="red", lwd=2)

> #### Plot 4: Type of Dot Plot ####
> pal <- c("red", "blue", "green")
> pal <- setNames(pal, c("virginica", "setosa", "versicolor"))
> 
> p <- plot_ly(data = iris, x = ~Sepal.Length, y = ~Petal.Length, color = ~Species, colors = pal,size=30)
> p

Figure 2: Write the name of this visualization in the space provided using red ink.

> #### Plot 5: Facetting ####
> ggplot(iris, aes(x=Sepal.Length, y=Petal.Length)) + geom_point() + facet_wrap(~Species)

> facet <- ggplot(data=iris, aes(Sepal.Length, y=Sepal.Width, color=Species)) + 
+   geom_point(aes(shape=Species), size=1.5) + geom_smooth(method="lm") +
+   xlab("Sepal Length") + ylab("Sepal Width") + ggtitle("Faceting")
> 
> # Along rows
> facet + facet_grid(. ~ Species)

> # Along columns
> facet + facet_grid(Species ~ .)

> #### Plot 6: Grouping ####
> ggplot(iris, aes(x=Petal.Length, y=Petal.Width, colour=Species)) +
+    geom_point() +
+    stat_ellipse()

> #### Graduate Students Only Plot 2: Edge Densities ####
> # Basic scatter plot with marginal density plot
> ggscatterhist(iris, x = "Sepal.Length", y = "Sepal.Width",
+               color = "#00AFBB",
+               margin.params = list(fill = "lightgray"))

> # Grouped data
> ggscatterhist(
+  iris, x = "Sepal.Length", y = "Sepal.Width",
+  color = "Species", size = 3, alpha = 0.6,
+  palette = c("#00AFBB", "#E7B800", "#FC4E07"),
+  margin.params = list(fill = "Species", color = "black", size = 0.2)
+ )

> # Use boxplot as marginal
> ggscatterhist(
+  iris, x = "Sepal.Length", y = "Sepal.Width",
+  color = "Species", size = 3, alpha = 0.6,
+  palette = c("#00AFBB", "#E7B800", "#FC4E07"),
+  margin.plot = "boxplot",
+  ggtheme = theme_bw()
+ )

A data item of Many Hats

A data item consisting of a quantitative value such as a count and two categorical attributes is simple enough to wear at least a dozen different visual idioms.

Plot 7: Mystery Idiom

In the R package {vcdExtra} there is a dataset called GSS.

General Social Survey– Sex and Party affiliation Description Data from the General Social Survey, 1991, on the relation between sex and party affiliation.

To prepare us in reading the visualization, present the head and structure of the file. Once you have identified this visual idiom give a quality visualizations with title, and descriptive axes. Use colors from the correct type of Brewer color palette. Report the totals in the vis gylph. [Hint: {ggpubr} has a nice wrapper. ]

Plot 8 - Playing Dodge ball

Repeat the above but dodge the bars rather than stack them.

Answer:

> #### A data item of Many Hats//Plot 7: Mystery Idiom//Plot 8 - Playing Dodge ball ####
> library(vcdExtra)
> library(ggplot2)
> head(GSS)
##      sex party count
## 1 female   dem   279
## 2   male   dem   165
## 3 female indep    73
## 4   male indep    47
## 5 female   rep   225
## 6   male   rep   191
> structure(GSS)
##      sex party count
## 1 female   dem   279
## 2   male   dem   165
## 3 female indep    73
## 4   male indep    47
## 5 female   rep   225
## 6   male   rep   191
> data(GSS)
> plot(GSS)

> (GSStab <- xtabs(count ~ sex + party, data=GSS))
##         party
## sex      dem indep rep
##   female 279    73 225
##   male   165    47 191
> GSStab
##         party
## sex      dem indep rep
##   female 279    73 225
##   male   165    47 191
> mod.glm <- glm(count ~ sex + party, family = poisson, data = GSS)
> mod.glm
## 
## Call:  glm(formula = count ~ sex + party, family = poisson, data = GSS)
## 
## Coefficients:
## (Intercept)      sexmale   partyindep     partyrep  
##     5.56611     -0.35891     -1.30833     -0.06514  
## 
## Degrees of Freedom: 5 Total (i.e. Null);  2 Residual
## Null Deviance:       271.4 
## Residual Deviance: 7.003     AIC: 55.59
> GSS <- data.frame(
+   expand.grid(sex=c("female", "male"), 
+               party=c("dem", "indep", "rep")),
+   count=c(279,165,73,47,225,191))
> GSS
##      sex party count
## 1 female   dem   279
## 2   male   dem   165
## 3 female indep    73
## 4   male indep    47
## 5 female   rep   225
## 6   male   rep   191
> names(GSS)
## [1] "sex"   "party" "count"
> str(GSS)
## 'data.frame':    6 obs. of  3 variables:
##  $ sex  : Factor w/ 2 levels "female","male": 1 2 1 2 1 2
##  $ party: Factor w/ 3 levels "dem","indep",..: 1 1 2 2 3 3
##  $ count: num  279 165 73 47 225 191
> sum(GSS$count)
## [1] 980
> ggplot(GSS, aes(x=sex, y=count)) + geom_point()

#### Plot 9 geom_count {ggplot2} has introduced a new geom called “geom_count”. It is a mixture of dot plot and tile plot. See the examples in https://ggplot2.tidyverse.org/reference/geom_count.html.

Use the diamond data set to create a visualization where there is a disk present for each combination of “color” (color as in the property of the gem), and “clarity”. The size of the disk reflects the total proportion of the diamonds in this joint-category scaled by the number of diamonds in the entire set.

Visualize the data items described above. Intoduce a title and subtitle. Improve the x and y axis and see what you can do to the legend. Include a caption. Finally add a small paragraph of text explaining to the owners what patterns are seen to exist in their inventory visualization.

Pick an appropriate publication theme.

What-Why-How 3: geom_count

Insert the visual idiom description for geom_count.

Answer:

> #### Plot 9 geom_count ####
> str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
> #Now, there are three parts to a ggplot2 graph. The first is the data we'll be graphing. In this case, we are plotting the diamonds data frame, so we type "diamonds". Second, we show the mapping of aesthetics to the attributes we'll be plotting. We type aes- meaning aesthetics- then open parentheses, and now our assignments: we say "x=carat", saying we want to put carat on the x-axis, then "y=price", saying what we want to put on the y-axis.
> ggplot(diamonds, aes(x=carat, y=price)) + geom_count()

> #Now every point is colored according to the quality of the clarity of each diamond. Notice that it created a legend on the right side. You can see that some of the lighter diamonds are more expensive if they have a high clarity rating, and conversely that some of the heavier diamonds aren't as expensive for having a low clarity rating. This is what leads to this rainbow pattern.
> ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_count()+
+ labs(title="Distribution of Diamond Inventory",
+                                 subtitle="Plot of the clarity rating on price",
+                                 caption="Data Source: diamonds dataset",
+                                 x= "Carat",
+                                 y= "Price")+ 
+                             scale_fill_brewer(palette="Dark2")

> #If we would rather see how the quality of the color or cut of the diamond affects the price, we can change the aesthetic. Here in "aes" we change "clarity" to "color".
> ggplot(diamonds, aes(x=carat, y=price, color=color)) + geom_count()

> #Now every item in the color legend is one of the ratings of color. Or we can change it to "cut":
> ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_count()

> #Now, what if we want to see the effect of both color and cut? We can use a fourth aesthetic, such as the size of the points. So here we have color representing the clarity. Let's add another aesthetic- let's say "size=cut."
> ggplot(diamonds, aes(x=carat, y=price, color=clarity, size=cut)) + geom_count()

> #Now the size of every point is determined by the cut even while the color is still determined by the clarity. Similarly, we could use the shape to represent the cut:
> ggplot(diamonds, aes(x=carat, y=price, color=clarity, shape=cut)) + geom_count()

> #if we want to add a smoothing curve that shows the general trend of the data
> ggplot(diamonds, aes(x=carat, y=price)) + geom_count() + geom_smooth()

> #Similarly, if we would rather show a best fit straight line rather than a curve, we can change the "method" option in the geom_smooth layer.
> ggplot(diamonds, aes(x=carat, y=price)) + geom_count() + geom_smooth(se=FALSE, method="lm")

### What-Why-How 3: geom_count ###

Graduate Students Only Plot 3: {vcd}

This is a type of research question. I’ve given you the paper called “The Structplot Framework: Multi-way Continegency Tables with vcd.” I want you to research the presentation of the the simplest contingency table (2 categorical variables, and one frequence or count variable). I give you three datasets to work with: GSS, Mental, and Glass. I want you to give me back three completely different visualizations. Page 8 of the paper lists several options for you to explore. Make them look good so the differences in design can be readily seen.

Answer:

> plot(GSS)

> GSStab <- xtabs(count ~ sex + party, data=GSS)
> # using the data in table form
> mod.glm1 <- glm(Freq ~ sex + party, family = poisson, data = GSStab)
> res <- residuals(mod.glm1)
> std <- rstandard(mod.glm1)
> # For mosaic.default(), need to re-shape residuals to conform to data
> stdtab <- array(std, dim=dim(GSStab), dimnames=dimnames(GSStab))
> mosaic(GSStab, gp=shading_Friendly, residuals=stdtab, residuals_type="Std\nresiduals",
+ labeling = labeling_residuals)

> # Using externally calculated residuals with the glm() object
> mosaic.glm(mod.glm1, residuals=std, labeling = labeling_residuals, shade=TRUE)

> # Using residuals_type
> mosaic.glm(mod.glm1, residuals_type="rstandard", labeling = labeling_residuals, shade=TRUE)

> plot(Mental)

> str(Mental)
## 'data.frame':    24 obs. of  3 variables:
##  $ ses   : Ord.factor w/ 6 levels "1"<"2"<"3"<"4"<..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ mental: Ord.factor w/ 4 levels "Well"<"Mild"<..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Freq  : int  64 94 58 46 57 94 54 40 57 105 ...
> (Mental.tab <- xtabs(Freq ~ ses+mental, data=Mental))
##    mental
## ses Well Mild Moderate Impaired
##   1   64   94       58       46
##   2   57   94       54       40
##   3   57  105       65       60
##   4   72  141       77       94
##   5   36   97       54       78
##   6   21   71       54       71
> # mosaic and sieve plots
> mosaic(Mental.tab, gp=shading_Friendly)

> sieve(Mental.tab, gp=shading_Friendly)

> library(ca)
> plot(ca(Mental.tab), main="Mental impairment & SES")
> title(xlab="Dim 1", ylab="Dim 2")

> plot(Glass)

> glass.tab <- xtabs(Freq ~ father + son, data=Glass)
> 
> largs <- list(set_varnames=list(father="Father's Occupation", son="Son's Occupation"),
+               abbreviate=10)
> gargs <- list(interpolate=c(1,2,4,8))
> mosaic(glass.tab, shade=TRUE, labeling_args=largs, gp_args=gargs,
+   main="Alphabetic order", legend=FALSE, rot_labels=c(20,90,0,70))

> # reorder by status
> ord <- c(2, 1, 4, 3, 5) 
> mosaic(glass.tab[ord, ord], shade=TRUE, labeling_args=largs,  gp_args=gargs,
+   main="Effect order", legend=FALSE, rot_labels=c(20,90,0,70))

Honest Maps

What-Why-How 4: Geom-Map

Insert the visual idiom description for {ggplot2} layer geom_map. In particular, focus on applications using geographic information such as the United States.

Answer:

Plot 10 - Maps

This problem gives you the solution, sort of. Start with the example provided in the provided at the bottom of the listing. I repeat it for your convinence and reference.

> # Better example
> library(maps)
> crimes <- data.frame(state = tolower(rownames(USArrests)), USArrests)
> crimesm <- reshape2::melt(crimes, id = 1)
> 
> if (require(maps)) {
+   states_map <- map_data("state")
+   ggplot(crimes, aes(map_id = state)) +
+     geom_map(aes(fill = Murder), map = states_map) +
+     expand_limits(x = states_map$long, y = states_map$lat)
+ 
+   last_plot() + coord_map()
+   ggplot(crimesm, aes(map_id = state)) +
+     geom_map(aes(fill = value), map = states_map) +
+     expand_limits(x = states_map$long, y = states_map$lat) +
+     facet_wrap( ~ variable)
+ }

Your job. Make this professional. Also you must use the Brewer color palette.

3 types

3 types

I think it is very difficult to distinguish between the blues. Why does the x axis have negative numbers? What do the y values mean. What is “Urban Pop”? Give it a title, subtitles and captions. Maybe a paragraph explaining. What is “value” on the legend.

This was an impressive visualization until we tried to use it for something.

Answer:

> crimes <- data.frame(state = tolower(rownames(USArrests)), USArrests)
> crimesm <- reshape2::melt(crimes, id = 1)
> 
> a<-if (require(maps)) {
+   states_map <- map_data("state")
+   ggplot(crimes, aes(map_id = state)) +
+     geom_map(aes(fill = Murder), map = states_map) +
+     scale_fill_gradientn(colours=c("blue","green","yellow","red"))+
+     expand_limits(x = states_map$long, y = states_map$lat)
+ 
+  last_plot() + coord_map()
+   ggplot(crimesm, aes(map_id = state)) +
+     geom_map(aes(fill = value), map = states_map) +
+     scale_fill_gradientn(colours=c("blue","green","yellow","red"))+
+     expand_limits(x = states_map$long, y = states_map$lat) +
+     facet_wrap( ~ variable)
+ }
> 
> a <- a + labs(title="Violent Crime Rates by US State",
+                                 subtitle="Arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states",
+                                 caption="USArrests",
+                                 x= "long",
+                                 y= "lat")
> a

State Plane and Universal Transverse MercatorAs noted previously, appropriate map projections have been adopted for each State,yielding “Earth” projections with coordinates based on latitude and longitude,the universal reference system. But these“latitude/longitude” references, as they will be referredto, are quite cumbersome,given that they are in degrees, minutes,and seconds.10 Two principal alternative coordinate systems are found in addition to latitude/longitude: the State Plane Coordinate System and the Universal Transverse Mercator (UTM).The State Plane Coordinate System was devised for greater user convenience, with a rectangular grid superimposed over the latitude/longitude graticule, producing State plane coordinates expressed in meters, yards, or feet. In effect, this system assumes that the individual States are flat so they can be described by plane geometry rather than the spherical grid. For local applications, this use of plane geometry is acceptable because error due to failure to take Earth curvature into account is not significant over relatively small areas such as police jurisdictions. Large States are divided into zones with separate grids for each to avoid the distortion problem. Texas, for example, is divided into the North, North-Central, Central, South-Central, and South zones; Louisiana into North, South, and Coastal. Typically, the origin, or zero point, for a State plane system is placed in the southwest corner, to avoid the inconvenient possibility of having to express coordinates in negative numbers. The origin is also placed outside the study area for the same reason.

The following graph displays the percentage of a state’s population that is classified as “Urban Population”. This can be used as a reference when comparing Urban Population percentages between states. We will consider how the arrest rates of a state are related to the Urban Population percentage of each state.

Essentially what’s happening here is that the map data (here called states) includes the latitute and longitude coordinates for the boundaries of each state. Specifying geom = “polygon” in the qplot function results in the states being traced out based on the lat/long coordinates. Urban Population of a state is related to the rate of arrests for Assault, Rape, Murder, and the Total Arrest Rates among states.

The colour is mapped red as high then yellow, then green and then blue.that light blue indicates areas with high murder rates while dark blue indicates areas with low murder rates

Graduate Students Only Plot 4: Scaling Data

As mentioned in class, many maps showing a single attribute held by people across the US, are usually just a popultation density map. Using what numbers you can find, scale the data so you have “number of murders per capita” and not just number of murders. (I might be wrong and this data already shows this, but I have my doubts. That it is difficult to distinguish the blues doesn’t help and the legend is meaningless doesn’t help.)

Answer:

Summary of USArrests The USArrests dataset is from 1975, showing arrests per 100,000 residents for assault, murder, and rape in each of the fifty states. Additionally, the percentage of the population living in urban areas is given.

One question is to whether the percentage of urban population in a state is related to the arrest rates of that state. Eventually, we will look at the correlations between population and arrest rates, in addition to focusing on the states with the highest arrest ragtes per category.

> summary(USArrests)
##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

Highest Arrest Rates The 3rd Quartiles for each variable are diplayed in the previous summary. The values above the third quartile make up the top 25% of the data. The following states have arrest rates above the 3rd quartiles in:

All Three Categories

> USArrests$States <- row.names(USArrests)
> USArrests$total <- (USArrests$Murder + USArrests$Assault + USArrests$Rape)
> USArrests <- USArrests[, c(5,1,2,3,4,6)]
> arrange(filter(USArrests, USArrests$Murder >= 11.25 & USArrests$Assault >= 249 & USArrests$Rape >= 26.18), desc(total))
##       States Murder Assault UrbanPop Rape total
## 1    Florida   15.4     335       80 31.9 382.3
## 2   Maryland   11.3     300       67 27.8 339.1
## 3 New Mexico   11.4     285       70 32.1 328.5
## 4     Nevada   12.2     252       81 46.0 310.2
## 5   Michigan   12.1     255       74 35.1 302.2

Observations If a state had one type of arrest rate above the 3rd Quartile, then it was likely to have multiple types of arrest rate above the 3rd Quartile. Only 5 states had all three categories (Murder, Assault, and Rape) in the top 25 percentile.

Urban Population Percentage The following graph displays the percentage of a state’s population that is classified as “Urban Population”. This can be used as a reference when comparing Urban Population percentages between states. As we go forward, we will consider how the arrest rates of a state are related to the Urban Population percentage of each state.

> library(ggplot2)
> library(maps)
> library(mapdata)
> usa <- map_data("usa")
> states = map_data("state")
> arrestrate = USArrests
> names(arrestrate) = tolower(names(arrestrate))
> arrestrate$region = tolower(rownames(USArrests))
> arrestrate_map = merge(states, arrestrate, sort = FALSE, by = "region")
> arrestrate_map = arrestrate_map[order(arrestrate_map$order), ]
> ggplot(arrestrate_map, aes(x=long,y=lat,group=group))+
+   geom_polygon(aes(fill=urbanpop))+
+   geom_path()+ 
+   scale_fill_gradientn(colours=rainbow(2),na.value="red90")+
+   coord_map()

Considering Correlations The following scatterplots show how the Urban Population of a state is related to the rate of arrests for Assault, Rape, Murder, and the Total Arrest Rates among states.

> par(mfrow = c(2,2))
> plot(USArrests$UrbanPop, USArrests$Assault, xlab = 'Urban Population \n', ylab = "Assault Rate", sub = 'Correlation Coefficient = 0.2588717', main = "Assault vs. Urban Population")
> fitline<- lm(Assault ~ UrbanPop, data = USArrests)
> abline(fitline)
> plot(USArrests$UrbanPop, USArrests$Rape, xlab = 'Urban Population \n', ylab = 'Rape', sub = 'Correlation Coefficient = 0.4113412', main = "Rape vs. Urban Population")
> fitline<- lm(Rape ~ UrbanPop, data = USArrests)
> abline(fitline)
> plot(USArrests$UrbanPop, USArrests$Murder, xlab = 'Urban Population \n', ylab = 'Murder' , sub = 'Correlation Coefficient = 0.0695726',main = "Murder vs. Urban Population")
> fitline<- lm(Murder ~ UrbanPop, data = USArrests)
> abline(fitline)
> plot(USArrests$UrbanPop, USArrests$total, xlab = 'Urban Population \n', ylab = 'Total Arrest Rate' , sub = 'Correlation Coefficient = 0.2755569',main = "Total Arrest Rate vs. Urban Population")
> fitline<- lm(total ~ UrbanPop, data = USArrests)
> abline(fitline)

Summary of Plots The plots were used to look for a correlation between the Urban Population percentage per state and the number of arrests per state. The Correlation Coefficient is a number between -1 and 1, which shows how two variables are related. A value close to 1 or -1 indicates a close relationship between the two variables. A value close to 0 indicates a weak relationship. The largest correlation was between Urban Population and Rape, as seen in the graph. The correlation coefficient was not indicative of an extremely strong relationship, but visually we can see that as Urban Population percentage increased, so did the occurrence of Rape in each state. Additionally, there is a slight positive relationship between Urban Population and every other variable, including the Total Arrest Rate. Therefore, as a State’s urban population increases, so do the rates of arrest.

Top 5 highest arrest rates of: Murders

> x<-arrange(filter(USArrests, USArrests$Murder >= 14.4), desc(Murder))
> x
##           States Murder Assault UrbanPop Rape total
## 1        Georgia   17.4     211       60 25.8 254.2
## 2    Mississippi   16.1     259       44 17.1 292.2
## 3        Florida   15.4     335       80 31.9 382.3
## 4      Louisiana   15.4     249       66 22.2 286.6
## 5 South Carolina   14.4     279       48 22.5 315.9

Final Summary In the end, there was only a slight relationship between what percentage of a state was urban and their arrest rates. There is almost a close correlation between the urban population and total arrest rates of the states with the five highest total arrest rates. However, North Carolina was a big outlier in terms of population percentage. Although one may assume that higher urban population percentages would be associated with higher arrest rates, that is not necessarily the case.